About the project

I heard about this R course’s usefulness in handling research data quite efficiently so my interest was guaranteed, although it takes some time to get more familiar with this programming language. The idea of this course’s validness for me was through an acquintance.

Aim of the course

Looking forward to challenge myself with this totally new thing, and finding some tools for my research data analyzes. More about R Studio you can find from here

Destination of this text

You can find my project here: https://github.com/GH-Club/IODS-project


Exercise 2

1. Reading the data and exploring the structure

students2014 <- read.csv(file = "/Users/streetman/IODS-project/data/lrn14.csv", header = T, sep=",")
students2014 <- students2014[,-1]
str(students2014)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
##  $ Age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ Attitude: int  37 31 25 35 37 38 35 29 38 21 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ Points  : int  25 12 24 10 22 21 21 31 24 26 ...
dim(students2014)
## [1] 166   7

Commentary

The data is a subset from the main data, which includes different questions to students and also some demographical variables for the purpose of doing a survey of approaches to learning. This subdata consists 166 rows as observations and 7 columns as variables: gender, Age, Attitude, deep (deep learning related questions), stra (strategic learning related questions), surf (surface learning related questions).

2. Graphical overview

library(GGally)
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
ggpairs(students2014, mapping = aes(col=gender, alpha=0.3), lower = list(combo = wrap("facethist", bins = 20)))

Commentary

Based on visualization we can detect for example that respondends were almost twice as much females (N=110) and their age varied between 17 and 55, and mean age was 25. Regarding correlation the strongest is between attitude and points and lowest between deep and points.

3. Regression model and fitting variables

## Fit a linear model
model_fit <- lm(Points ~ Attitude + stra + surf, data = students2014)

## Print out a summary of the model
summary (model_fit)
## 
## Call:
## lm(formula = Points ~ Attitude + stra + surf, data = students2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.01711    3.68375   2.991  0.00322 ** 
## Attitude     0.33952    0.05741   5.913 1.93e-08 ***
## stra         0.85313    0.54159   1.575  0.11716    
## surf        -0.58607    0.80138  -0.731  0.46563    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08
## Fit a linear model
model_fit <- lm(Points ~ Attitude + stra, data = students2014)

## Print out a summary of the model
summary (model_fit)
## 
## Call:
## lm(formula = Points ~ Attitude + stra, data = students2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.6436  -3.3113   0.5575   3.7928  10.9295 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.97290    2.39591   3.745  0.00025 ***
## Attitude     0.34658    0.05652   6.132 6.31e-09 ***
## stra         0.91365    0.53447   1.709  0.08927 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared:  0.2048, Adjusted R-squared:  0.1951 
## F-statistic: 20.99 on 2 and 163 DF,  p-value: 7.734e-09
## Fit a linear model
model_fit <- lm(Points ~ Attitude, data = students2014)

## Print out a summary of the model
summary (model_fit)
## 
## Call:
## lm(formula = Points ~ Attitude, data = students2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.63715    1.83035   6.358 1.95e-09 ***
## Attitude     0.35255    0.05674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

Commentary

The target variable is Points. For the chosen explanatory variables Attitude, stra and surf: with stra and surf there is no statistically significant relationship with the target variable (p=0.117 and 0.466), therefore surf was removed.

After that, with variable stra, there is no statistically significant relationship with the target variable (p=0.089), therefore it was removed.

After that, with variable attitude, there is statistically significant correlation (p = <0.000) to points with an explanatory value of 0.1906 (mult. R-squared) which mean that this variable explains approx. 19% of the points achieved by students.

4. Diagnostic plots (Residuals vs. Fitted values, Normal Q-Q plot and Residuals vs Leverage)

my_model2 <- lm(Points ~ Attitude, data = students2014)

plot(my_model2, which = c(1,2,5))

Commentary

The residuals are the differences between the predicted and observed value in a given point. “Residuals vs. Fitted”: assumption that errors don’t depend on variable attitude is valid since the plots are evenly spread without any specific pattern. “Normal Q-Q”: assumption is that the residuals are normally distributed and this supports it. “Residuals vs. Leverage”: in general observations are not having unusually high impact.


Exercise 3

1. Getting and reading the .csv file

getwd()
## [1] "/Users/streetman/IODS-project"
alc <- read.csv(file = "/Users/streetman/IODS-project/data/alc.csv", header = T, sep=",")
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:GGally':
## 
##     nasa
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
colnames(alc)
##  [1] "X"          "school"     "sex"        "age"        "address"   
##  [6] "famsize"    "Pstatus"    "Medu"       "Fedu"       "Mjob"      
## [11] "Fjob"       "reason"     "nursery"    "internet"   "guardian"  
## [16] "traveltime" "studytime"  "failures"   "schoolsup"  "famsup"    
## [21] "paid"       "activities" "higher"     "romantic"   "famrel"    
## [26] "freetime"   "goout"      "Dalc"       "Walc"       "health"    
## [31] "absences"   "G1"         "G2"         "G3"         "alc_use"   
## [36] "high_use"

Commentary

The dataset includes the variables above indicating the questions to students relating to their alcohol usage

2. Relationships between high and low alcohol consumption, and choosing the parameters for this exploration

Choosing the parameters

library(tidyr)
glimpse(alc)
## Observations: 382
## Variables: 36
## $ X          <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16…
## $ school     <fct> GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, G…
## $ sex        <fct> F, F, F, F, F, M, M, F, M, M, F, F, M, M, M, F, F, F,…
## $ age        <int> 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15, 15, 1…
## $ address    <fct> U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U,…
## $ famsize    <fct> GT3, GT3, LE3, GT3, GT3, LE3, LE3, GT3, LE3, GT3, GT3…
## $ Pstatus    <fct> A, T, T, T, T, T, T, A, A, T, T, T, T, T, A, T, T, T,…
## $ Medu       <int> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, 3,…
## $ Fedu       <int> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4, 3,…
## $ Mjob       <fct> at_home, at_home, at_home, health, other, services, o…
## $ Fjob       <fct> teacher, other, other, services, other, other, other,…
## $ reason     <fct> course, course, other, home, home, reputation, home, …
## $ nursery    <fct> yes, no, yes, yes, yes, yes, yes, yes, yes, yes, yes,…
## $ internet   <fct> no, yes, yes, yes, no, yes, yes, no, yes, yes, yes, y…
## $ guardian   <fct> mother, father, mother, mother, father, mother, mothe…
## $ traveltime <int> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1, 1, 3,…
## $ studytime  <int> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3, 2,…
## $ failures   <int> 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ schoolsup  <fct> yes, no, yes, no, no, no, no, yes, no, no, no, no, no…
## $ famsup     <fct> no, yes, no, yes, yes, yes, no, yes, yes, yes, yes, y…
## $ paid       <fct> no, no, yes, yes, yes, yes, no, no, yes, yes, yes, no…
## $ activities <fct> no, no, no, yes, no, yes, no, no, no, yes, no, yes, y…
## $ higher     <fct> yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes…
## $ romantic   <fct> no, no, no, yes, no, no, no, no, no, no, no, no, no, …
## $ famrel     <int> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3, 5,…
## $ freetime   <int> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2, 3,…
## $ goout      <int> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3, 2,…
## $ Dalc       <int> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ Walc       <int> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2, 2, 1,…
## $ health     <int> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2, 4,…
## $ absences   <int> 5, 3, 8, 1, 2, 8, 0, 4, 0, 0, 1, 2, 1, 1, 0, 5, 8, 3,…
## $ G1         <int> 2, 7, 10, 14, 8, 14, 12, 8, 16, 13, 12, 10, 13, 11, 1…
## $ G2         <int> 8, 8, 10, 14, 12, 14, 12, 9, 17, 14, 11, 12, 14, 11, …
## $ G3         <int> 8, 8, 11, 14, 12, 14, 12, 10, 18, 14, 12, 12, 13, 12,…
## $ alc_use    <dbl> 1.0, 1.0, 2.5, 1.0, 1.5, 1.5, 1.0, 1.0, 1.0, 1.0, 1.5…
## $ high_use   <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE…

Commentary

These four variables were chosen: 1. Age (numeric) 2. Famsize (binary: less or equal to 3, or greater than 3) 3. Failures (numeric: 0-3, or 4 if smth else) 4. Absences (numeric: 0 to 93)

Hypothesis: there is a correlation between the high alcohol consumption and the chosen variables when analyzed in a univariate method to observe correlation H0: there is no correlation between a chosen variable and high alcohol consumption H1: there is significant correlation between a chosen variable and high alcohol consumption

3. Distributions between the chosen variables above and the relations to alcohol consumption

Numerical overview of data

summary(alc)
##        X          school   sex          age        address famsize  
##  Min.   :  1.00   GP:342   F:198   Min.   :15.00   R: 81   GT3:278  
##  1st Qu.: 96.25   MS: 40   M:184   1st Qu.:16.00   U:301   LE3:104  
##  Median :191.50                    Median :17.00                    
##  Mean   :191.50                    Mean   :16.59                    
##  3rd Qu.:286.75                    3rd Qu.:17.00                    
##  Max.   :382.00                    Max.   :22.00                    
##  Pstatus      Medu            Fedu             Mjob           Fjob    
##  A: 38   Min.   :0.000   Min.   :0.000   at_home : 53   at_home : 16  
##  T:344   1st Qu.:2.000   1st Qu.:2.000   health  : 33   health  : 17  
##          Median :3.000   Median :3.000   other   :138   other   :211  
##          Mean   :2.806   Mean   :2.565   services: 96   services:107  
##          3rd Qu.:4.000   3rd Qu.:4.000   teacher : 62   teacher : 31  
##          Max.   :4.000   Max.   :4.000                                
##         reason    nursery   internet    guardian     traveltime   
##  course    :140   no : 72   no : 58   father: 91   Min.   :1.000  
##  home      :110   yes:310   yes:324   mother:275   1st Qu.:1.000  
##  other     : 34                       other : 16   Median :1.000  
##  reputation: 98                                    Mean   :1.448  
##                                                    3rd Qu.:2.000  
##                                                    Max.   :4.000  
##    studytime        failures      schoolsup famsup     paid     activities
##  Min.   :1.000   Min.   :0.0000   no :331   no :144   no :205   no :181   
##  1st Qu.:1.000   1st Qu.:0.0000   yes: 51   yes:238   yes:177   yes:201   
##  Median :2.000   Median :0.0000                                           
##  Mean   :2.037   Mean   :0.2016                                           
##  3rd Qu.:2.000   3rd Qu.:0.0000                                           
##  Max.   :4.000   Max.   :3.0000                                           
##  higher    romantic      famrel         freetime        goout      
##  no : 18   no :261   Min.   :1.000   Min.   :1.00   Min.   :1.000  
##  yes:364   yes:121   1st Qu.:4.000   1st Qu.:3.00   1st Qu.:2.000  
##                      Median :4.000   Median :3.00   Median :3.000  
##                      Mean   :3.937   Mean   :3.22   Mean   :3.113  
##                      3rd Qu.:5.000   3rd Qu.:4.00   3rd Qu.:4.000  
##                      Max.   :5.000   Max.   :5.00   Max.   :5.000  
##       Dalc            Walc           health         absences   
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   : 0.0  
##  1st Qu.:1.000   1st Qu.:1.000   1st Qu.:3.000   1st Qu.: 1.0  
##  Median :1.000   Median :2.000   Median :4.000   Median : 3.0  
##  Mean   :1.482   Mean   :2.296   Mean   :3.573   Mean   : 4.5  
##  3rd Qu.:2.000   3rd Qu.:3.000   3rd Qu.:5.000   3rd Qu.: 6.0  
##  Max.   :5.000   Max.   :5.000   Max.   :5.000   Max.   :45.0  
##        G1              G2              G3           alc_use     
##  Min.   : 2.00   Min.   : 4.00   Min.   : 0.00   Min.   :1.000  
##  1st Qu.:10.00   1st Qu.:10.00   1st Qu.:10.00   1st Qu.:1.000  
##  Median :12.00   Median :12.00   Median :12.00   Median :1.500  
##  Mean   :11.49   Mean   :11.47   Mean   :11.46   Mean   :1.889  
##  3rd Qu.:14.00   3rd Qu.:14.00   3rd Qu.:14.00   3rd Qu.:2.500  
##  Max.   :18.00   Max.   :18.00   Max.   :18.00   Max.   :5.000  
##   high_use      
##  Mode :logical  
##  FALSE:268      
##  TRUE :114      
##                 
##                 
## 

Commentary

For example in variable “absences” the mean absence value is 4.5 and maximum is 45. Range is wide, although 3rd quartile is 6.0, therefore most are within values 0 to 6.0.

Graphical overview of data

Distribution of the chosen variables

library(tidyr); library(dplyr); library(ggplot2)
alc_selected <- select(alc, one_of(c("age", "famsize", "failures", "absences")))
gather(alc_selected) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar()
## Warning: attributes are not identical across measure variables;
## they will be dropped

Commentary

For example in variable “age” we can detect that mainly the participants are below 19 years old.

Variables vs. alcohol usage by histograms

Age

g1 <- ggplot(alc, aes(x = high_use, y = age))
g1 + geom_boxplot(aes(col=sex)) + ylab("age") + ggtitle("Students alcohol consumption and age")

Commentary

Those with high alcohol consumption were mainly male and their age range is wider compared to female within this consumption category.

Famsize

g1 <- ggplot(data = alc, aes(x = alc_use, fill = famsize))
g1 + geom_bar()

Commentary

It seems that more higher the alcohol consumption then more equal are the family sizes. With lower alcohol consumption the family sizes were considerable larger.

Failures

alc$failures <-  as.factor(alc$failures)
g1 <- ggplot(alc, aes(x = high_use, fill = failures))
g1 + geom_bar()

Commentary

Somewhat surpringly, in lower alcohol consumption the amount of failures is higher.

Absences

g1 <- ggplot(alc, aes(x = high_use, y = absences))
g1 + geom_boxplot(aes(col=sex)) + ylab("absences") + ggtitle("Students alcohol consumption and absences")

Commentary

It seems that more the absences then more are is the high alcohol consumption with both sexes.

4. Logistic regression model of the chosen variables

Fitting the LR model

## Fit a linear model
alc$failures <-  as.numeric(alc$failures)
alc$famsize <-  as.numeric(alc$famsize)
model_fit <- lm(high_use ~ age + famsize + failures + absences, data = alc)

## Print out a summary of the model
summary (model_fit)
## 
## Call:
## lm(formula = high_use ~ age + famsize + failures + absences, 
##     data = alc)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.9850 -0.2883 -0.2132  0.5052  0.8587 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.399649   0.327648  -1.220  0.22332    
## age          0.023963   0.019765   1.212  0.22612    
## famsize      0.075069   0.050854   1.476  0.14073    
## failures     0.106463   0.039562   2.691  0.00744 ** 
## absences     0.017151   0.004184   4.099 5.09e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4419 on 377 degrees of freedom
## Multiple R-squared:  0.07954,    Adjusted R-squared:  0.06977 
## F-statistic: 8.144 on 4 and 377 DF,  p-value: 2.625e-06
## Fit a linear model
alc$failures <-  as.numeric(alc$failures)
model_fit <- lm(high_use ~ age + failures + absences, data = alc)

## Print out a summary of the model
summary (model_fit)
## 
## Call:
## lm(formula = high_use ~ age + failures + absences, data = alc)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.0151 -0.2787 -0.2162  0.5361  0.8393 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.308391   0.322265  -0.957  0.33920    
## age          0.024304   0.019795   1.228  0.22029    
## failures     0.104492   0.039601   2.639  0.00867 ** 
## absences     0.017366   0.004188   4.146 4.17e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4426 on 378 degrees of freedom
## Multiple R-squared:  0.07422,    Adjusted R-squared:  0.06687 
## F-statistic:  10.1 on 3 and 378 DF,  p-value: 2.045e-06
## Fit a linear model
alc$failures <-  as.numeric(alc$failures)
alc$famsize <-  as.numeric(alc$famsize)
model_fit <- lm(high_use ~ failures + absences, data = alc)

## Print out a summary of the model
summary (model_fit)
## 
## Call:
## lm(formula = high_use ~ failures + absences, data = alc)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.0035 -0.2801 -0.2127  0.5510  0.8053 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.081634   0.054308   1.503  0.13363    
## failures    0.113115   0.038999   2.900  0.00394 ** 
## absences    0.017973   0.004162   4.319 2.01e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4429 on 379 degrees of freedom
## Multiple R-squared:  0.07052,    Adjusted R-squared:  0.06562 
## F-statistic: 14.38 on 2 and 379 DF,  p-value: 9.573e-07

Commentary

The target variable is high_use (of alcohol). For the chosen explanatory variables age, family size, failures and absences: with age and family size there is no statistically significant relationship with the target variable (p=0.089 and 0.170), therefore family size was removed.

After that, with variable age, there is no statistically significant relationship with the target variable (p=0.089), therefore it was removed.

After that, with variables failures and absences, there is statistically significant correlation (p=0.004 and <0.000) to high_use (of alcohol) with an explanatory value of 0.07 (mult. R-squared) which means that these variables explains approx. 7% of the high_use (of alcohol) among the participants.

Odds rations and CI

### Fitting the logistic regressio model
m <- glm(high_use ~ failures + absences, data = alc, family = "binomial")

### Compute odds ratios (OR)
OR <- coef(m) %>% exp

### Compute confidence intervals (CI)
CI <- confint(m) %>% exp
## Waiting for profiling to be done...
### Printing the output
cbind(OR, CI)
##                    OR      2.5 %    97.5 %
## (Intercept) 0.1528951 0.08788853 0.2604241
## failures    1.6482547 1.14889905 2.3830880
## absences    1.0898257 1.04433337 1.1423489

Commentary

Both variables’, failures and absences, ORs indicate higher odds for high_use (of alcohol) within these variables; yet, the CI crosses value 1 with the variable failures, therefore it seems to inadequate for showing statistical significance in this sense. By this analysis, from the chosen variables, absenses seems to be the one with most statistical significance in its relation to high_use (of alcohol).

5. Predictive power of LR model

## Fit the model
m <- glm(high_use ~ absences, data = alc, family = "binomial")

## Predict() the probability of high_use
probabilities <- predict(m, type = "response")

## Add the predicted probabilities to 'alc'
alc <- mutate(alc, probability = probabilities)

## Use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = probability > 0.5)

## Tabulate the target variable versus the predictions
results <- table(high_use = alc$high_use, prediction = alc$prediction)

results
##         prediction
## high_use FALSE TRUE
##    FALSE   263    5
##    TRUE    105    9
tab <- table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table() %>% addmargins()

tab
##         prediction
## high_use      FALSE       TRUE        Sum
##    FALSE 0.68848168 0.01308901 0.70157068
##    TRUE  0.27486911 0.02356021 0.29842932
##    Sum   0.96335079 0.03664921 1.00000000
## Total proportion of inaccurately classified individuals
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2879581

Commentary

After LR analysis, it was stated that variable “absences” has the most statistical significance it relation to high_use (of alcohol), therefore chosen here to test the predictive power of the LR model.

After calculation, the value of 0.2879 indicates that approx. 29% are incorrect predictions with this model, yet 71% are correct ones, therefore this LR model is useful.


Exercise 4

1. Loading the data and exploring it

install.packages(“MASS”)

install.packages(“corrplot”)

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
library(dplyr)
library(tidyr)
library(ggplot2)
library(GGally)
library(corrplot)
## corrplot 0.84 loaded

Loading

data("Boston")

Exploring

str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506  14

Commentary

The dataset has 14 columns (variables) and 506 rows (observations). The data is about housing values in Boston suburbs. More about the data variables you can find here

2. Graphical overview and summaries

Summary

summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

Correlations

cor_matrix <- cor(Boston)

cor_matrix %>% round(2)
##          crim    zn indus  chas   nox    rm   age   dis   rad   tax
## crim     1.00 -0.20  0.41 -0.06  0.42 -0.22  0.35 -0.38  0.63  0.58
## zn      -0.20  1.00 -0.53 -0.04 -0.52  0.31 -0.57  0.66 -0.31 -0.31
## indus    0.41 -0.53  1.00  0.06  0.76 -0.39  0.64 -0.71  0.60  0.72
## chas    -0.06 -0.04  0.06  1.00  0.09  0.09  0.09 -0.10 -0.01 -0.04
## nox      0.42 -0.52  0.76  0.09  1.00 -0.30  0.73 -0.77  0.61  0.67
## rm      -0.22  0.31 -0.39  0.09 -0.30  1.00 -0.24  0.21 -0.21 -0.29
## age      0.35 -0.57  0.64  0.09  0.73 -0.24  1.00 -0.75  0.46  0.51
## dis     -0.38  0.66 -0.71 -0.10 -0.77  0.21 -0.75  1.00 -0.49 -0.53
## rad      0.63 -0.31  0.60 -0.01  0.61 -0.21  0.46 -0.49  1.00  0.91
## tax      0.58 -0.31  0.72 -0.04  0.67 -0.29  0.51 -0.53  0.91  1.00
## ptratio  0.29 -0.39  0.38 -0.12  0.19 -0.36  0.26 -0.23  0.46  0.46
## black   -0.39  0.18 -0.36  0.05 -0.38  0.13 -0.27  0.29 -0.44 -0.44
## lstat    0.46 -0.41  0.60 -0.05  0.59 -0.61  0.60 -0.50  0.49  0.54
## medv    -0.39  0.36 -0.48  0.18 -0.43  0.70 -0.38  0.25 -0.38 -0.47
##         ptratio black lstat  medv
## crim       0.29 -0.39  0.46 -0.39
## zn        -0.39  0.18 -0.41  0.36
## indus      0.38 -0.36  0.60 -0.48
## chas      -0.12  0.05 -0.05  0.18
## nox        0.19 -0.38  0.59 -0.43
## rm        -0.36  0.13 -0.61  0.70
## age        0.26 -0.27  0.60 -0.38
## dis       -0.23  0.29 -0.50  0.25
## rad        0.46 -0.44  0.49 -0.38
## tax        0.46 -0.44  0.54 -0.47
## ptratio    1.00 -0.18  0.37 -0.51
## black     -0.18  1.00 -0.37  0.33
## lstat      0.37 -0.37  1.00 -0.74
## medv      -0.51  0.33 -0.74  1.00
"VISUALIZING the correlations"
## [1] "VISUALIZING the correlations"
corrplot(cor_matrix, method="circle", type = "upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6)

Commentary

Crime rate seems to have strongest positive correlation to property taxation rate and highways accessibility.

3. Standardize

boston_scaled <- scale(Boston)

summary(boston_scaled)
##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865

Commentary

We scaled the data to get standardization for to be able to compare variables in a better way.

Creating categorical variable

boston_scaled <- as.data.frame(boston_scaled)

bins <- quantile(boston_scaled$crim)

crime <- cut(boston_scaled$crim, breaks = bins, labels = c("low","med_low","med_high","high"), include.lowest = TRUE)

boston_scaled <- dplyr::select(boston_scaled, -crim)

boston_scaled <- data.frame(boston_scaled, crime)

nrow(boston_scaled)
## [1] 506
n <- nrow(boston_scaled)

ind <- sample(n,  size = n * 0.8)

train <- boston_scaled[ind,]

test <- boston_scaled[-ind,]

correct_classes <- test$crime

test <- dplyr::select(test, -crime)

summary(test)
##        zn                indus               chas         
##  Min.   :-0.487240   Min.   :-1.42220   Min.   :-0.27233  
##  1st Qu.:-0.487240   1st Qu.:-0.86210   1st Qu.:-0.27233  
##  Median :-0.487240   Median :-0.21089   Median :-0.27233  
##  Mean   : 0.008159   Mean   : 0.04924   Mean   : 0.03646  
##  3rd Qu.: 0.209513   3rd Qu.: 1.01499   3rd Qu.:-0.27233  
##  Max.   : 3.586088   Max.   : 2.42017   Max.   : 3.66477  
##       nox                 rm                age           
##  Min.   :-1.42991   Min.   :-3.44659   Min.   :-2.222999  
##  1st Qu.:-0.96024   1st Qu.:-0.58123   1st Qu.:-0.814417  
##  Median :-0.14407   Median :-0.07918   Median : 0.317068  
##  Mean   :-0.03968   Mean   : 0.06373   Mean   : 0.009251  
##  3rd Qu.: 0.59809   3rd Qu.: 0.60754   3rd Qu.: 0.897020  
##  Max.   : 2.72965   Max.   : 3.55153   Max.   : 1.116390  
##       dis                rad                 tax          
##  Min.   :-1.17464   Min.   :-0.981871   Min.   :-1.31269  
##  1st Qu.:-0.82567   1st Qu.:-0.637331   1st Qu.:-0.77572  
##  Median :-0.19537   Median :-0.522484   Median :-0.39895  
##  Mean   : 0.03516   Mean   : 0.003333   Mean   : 0.02471  
##  3rd Qu.: 0.55661   3rd Qu.: 1.659603   3rd Qu.: 1.52941  
##  Max.   : 3.95660   Max.   : 1.659603   Max.   : 1.79642  
##     ptratio             black              lstat         
##  Min.   :-2.51994   Min.   :-3.72665   Min.   :-1.52961  
##  1st Qu.:-0.75315   1st Qu.: 0.22700   1st Qu.:-0.75067  
##  Median : 0.11292   Median : 0.39368   Median :-0.29170  
##  Mean   :-0.03244   Mean   : 0.09837   Mean   :-0.01283  
##  3rd Qu.: 0.80578   3rd Qu.: 0.44062   3rd Qu.: 0.63393  
##  Max.   : 1.26768   Max.   : 0.44062   Max.   : 2.70785  
##       medv         
##  Min.   :-1.66713  
##  1st Qu.:-0.63148  
##  Median :-0.15035  
##  Mean   : 0.05986  
##  3rd Qu.: 0.39058  
##  Max.   : 2.98650

Commentary

We made a categorical variable of crime rate with the division into quantiles indicated above. We also made datasets (“training” and “test”) for testing and implementing the LDA (linear discriminant analysis) model.

4. Fit the linear

lda.fit <- lda(crime ~ ., data = train)

lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2376238 0.2574257 0.2500000 0.2549505 
## 
## Group means:
##                  zn      indus         chas        nox          rm
## low       1.0411511 -0.9586192 -0.108283225 -0.8662085  0.46064529
## med_low  -0.1139205 -0.3422746 -0.007331936 -0.5788835 -0.15673511
## med_high -0.3836558  0.1766433  0.117482837  0.3997728  0.04318968
## high     -0.4872402  1.0170891 -0.042983423  1.0391280 -0.37654422
##                 age        dis        rad        tax     ptratio
## low      -0.8969392  0.8418206 -0.6863802 -0.7208332 -0.42693157
## med_low  -0.3762243  0.3805142 -0.5523004 -0.5271986 -0.06295901
## med_high  0.4084842 -0.3704369 -0.4531215 -0.3411977 -0.29319053
## high      0.8061459 -0.8403926  1.6384176  1.5142626  0.78111358
##               black       lstat        medv
## low       0.3726995 -0.79330536  0.53328674
## med_low   0.3206780 -0.15021277 -0.01213996
## med_high  0.1079413  0.01084318  0.13799712
## high     -0.8744200  0.89313159 -0.67938128
## 
## Coefficients of linear discriminants:
##                 LD1           LD2         LD3
## zn       0.06051850  0.8886752341 -0.95796121
## indus    0.13171351 -0.3876117205  0.33691394
## chas    -0.02283016 -0.0040069457  0.12751008
## nox      0.33985474 -0.6145645311 -1.33837578
## rm       0.08064396 -0.0580078405 -0.14349424
## age      0.21750725 -0.3428648511 -0.18757385
## dis     -0.02077333 -0.4596329599  0.31658896
## rad      3.57518072  0.9711249633  0.24593269
## tax      0.11780862 -0.1060161167  0.24891951
## ptratio  0.12453220  0.0350012204 -0.31904844
## black   -0.05683481 -0.0004998502  0.07551261
## lstat    0.21038037 -0.2878402466  0.34365478
## medv     0.07644718 -0.5088738221 -0.25129282
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9543 0.0338 0.0119
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

classes <- as.numeric(train$crime)

plot(lda.fit, dimen = 2, col=classes, pch=classes)
lda.arrows(lda.fit, myscale = 1)

Commentary

Here we tested the LDA model to the “training” dataset, and by visualization we detect distintiveness between crime rate groups with some amount of overlapping between them.

5. Predicting the classes and cross tabulation

lda.pred <- predict(lda.fit, newdata = test)

table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       11      18        2    0
##   med_low    5       7       10    0
##   med_high   2       6       14    3
##   high       0       0        0   24

Commentary

Here we implemented the LDA to “test” dataset. It indicates that higher crime values are better predicted than lower ones. As a result, not the optimal prediction model, but fairly good.

6. Reload and standardize

data(Boston)

boston_scaled_2 <- scale(Boston)

str(boston_scaled_2)
##  num [1:506, 1:14] -0.419 -0.417 -0.417 -0.416 -0.412 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:506] "1" "2" "3" "4" ...
##   ..$ : chr [1:14] "crim" "zn" "indus" "chas" ...
##  - attr(*, "scaled:center")= Named num [1:14] 3.6135 11.3636 11.1368 0.0692 0.5547 ...
##   ..- attr(*, "names")= chr [1:14] "crim" "zn" "indus" "chas" ...
##  - attr(*, "scaled:scale")= Named num [1:14] 8.602 23.322 6.86 0.254 0.116 ...
##   ..- attr(*, "names")= chr [1:14] "crim" "zn" "indus" "chas" ...
dim(boston_scaled_2)
## [1] 506  14

Distance matrix

dist_eu <- dist(boston_scaled_2)
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970

Distance matrix (Manhattan method)

dist_man <- dist(boston_scaled_2, method = "manhattan")

summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970

Clustering

km <- kmeans(boston_scaled_2, centers = 3)

pairs(boston_scaled_2, col = km$cluster)

Determining

k_max <- 10

twcss <- sapply(1:k_max, function(k){kmeans(Boston, k)$tot.withinss})

qplot(x = 1:k_max, y = twcss, geom = 'line')

km <- kmeans(boston_scaled_2, centers = 2)

pairs(boston_scaled_2, col = km$cluster)

Commentary

Here we clustered the data by loading it again and concluded that the it is the most optimally clustered via two clusters as indicated in the graph above where curve changes the most at this point.


Exercise 5

library(data.table)
## 
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
## 
##     between, first, last
library(GGally)
library(corrplot)
library(tidyr)
library(ggplot2)

1. Getting and reading the .csv file

getwd()
## [1] "/Users/streetman/IODS-project"
human <- read.csv(file = "/Users/streetman/IODS-project/data/human.csv", header = T, sep=",")
dim(human)
## [1] 155   9
str(human)
## 'data.frame':    155 obs. of  9 variables:
##  $ X        : Factor w/ 155 levels "Afghanistan",..: 105 6 134 41 101 54 67 149 27 102 ...
##  $ Edu2.FM  : num  97.4 94.3 95 95.5 87.7 96.3 80.5 95.1 100 95 ...
##  $ Labo.FM  : num  0.891 0.819 0.825 0.884 0.829 ...
##  $ Life.Exp : num  81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
##  $ Edu.Exp  : num  17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
##  $ GNI      : Factor w/ 155 levels "1,123","1,228",..: 132 109 124 112 113 111 103 123 108 93 ...
##  $ Mat.Mor  : int  4 6 6 5 6 7 9 28 11 8 ...
##  $ Ado.Birth: num  7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
##  $ Parli.F  : num  39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
summary(human)
##            X          Edu2.FM          Labo.FM          Life.Exp    
##  Afghanistan:  1   Min.   :  0.90   Min.   :0.1857   Min.   :49.00  
##  Albania    :  1   1st Qu.: 27.15   1st Qu.:0.5984   1st Qu.:66.30  
##  Algeria    :  1   Median : 56.60   Median :0.7535   Median :74.20  
##  Argentina  :  1   Mean   : 55.37   Mean   :0.7074   Mean   :71.65  
##  Armenia    :  1   3rd Qu.: 85.15   3rd Qu.:0.8535   3rd Qu.:77.25  
##  Australia  :  1   Max.   :100.00   Max.   :1.0380   Max.   :83.50  
##  (Other)    :149                                                    
##     Edu.Exp           GNI         Mat.Mor         Ado.Birth     
##  Min.   : 5.40   1,123  :  1   Min.   :   1.0   Min.   :  0.60  
##  1st Qu.:11.25   1,228  :  1   1st Qu.:  11.5   1st Qu.: 12.65  
##  Median :13.50   1,428  :  1   Median :  49.0   Median : 33.60  
##  Mean   :13.18   1,458  :  1   Mean   : 149.1   Mean   : 47.16  
##  3rd Qu.:15.20   1,507  :  1   3rd Qu.: 190.0   3rd Qu.: 71.95  
##  Max.   :20.20   1,583  :  1   Max.   :1100.0   Max.   :204.80  
##                  (Other):149                                    
##     Parli.F     
##  Min.   : 0.00  
##  1st Qu.:12.40  
##  Median :19.30  
##  Mean   :20.91  
##  3rd Qu.:27.95  
##  Max.   :57.50  
## 

Commentary

The dataset is about United Nations Development Programme. Variables are in respective order:

Ratio of secondary education of female/male (Edu2.FM) Ratio of labour forced female/male (Labo.FM) Expected years of schooling (Edu.Exp) Life expectancy at birth (Life.Exp) Gross National Income per capita (GNI) Maternal mortality ratio (Mat.Mor) Adolescent birth rate (Ado.Birth) Percentage of female representatives in parliament (Parli.F)

We can detect for example expected years of education is 5.40 years at minimum with a median of 13.18 years. Regarding life expectancy, the minimum value is 49 years of age and mean value being approx. 72 years.

2. Graphical overview

human$GNI <- as.integer(human$GNI)

human_ <- human[,-1]

dim(human_)
## [1] 155   8
str(human_)
## 'data.frame':    155 obs. of  8 variables:
##  $ Edu2.FM  : num  97.4 94.3 95 95.5 87.7 96.3 80.5 95.1 100 95 ...
##  $ Labo.FM  : num  0.891 0.819 0.825 0.884 0.829 ...
##  $ Life.Exp : num  81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
##  $ Edu.Exp  : num  17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
##  $ GNI      : int  132 109 124 112 113 111 103 123 108 93 ...
##  $ Mat.Mor  : int  4 6 6 5 6 7 9 28 11 8 ...
##  $ Ado.Birth: num  7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
##  $ Parli.F  : num  39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
ggpairs(human_)

cor(human_) %>% corrplot(method="circle", type="upper", cl.pos="b", tl.pos="d", tl.cex = 0.6)

Commentary

Graphical illustration demonstrates distributions of the data variables and their correlation to each other.

Correlation matrix is giving further supportive evidence of the correlation between data parameters.

3. PCA

pca_human <- prcomp(human_)

s <- summary(pca_human)

pca_pr <- round(100*s$importance[2, ], digits = 1)

pca_pr
##  PC1  PC2  PC3  PC4  PC5  PC6  PC7  PC8 
## 93.4  4.1  1.5  0.7  0.3  0.0  0.0  0.0
biplot(pca_human, choices = 1:2, cex=c(0.8, 1), col=c("grey40", "deeppink2"))

Commentary

The first table is indicating that majority of the data is explained by principal component 1 (PC1) so the data could be summarized towards that dimension.

The biplot graph indicates that the PC1 explains majority of cross-observation variation with maternal mortality as a driving component.

4. Standardize

human_std <- scale(human_)
summary(human_std)
##     Edu2.FM            Labo.FM           Life.Exp          Edu.Exp       
##  Min.   :-1.76122   Min.   :-2.6247   Min.   :-2.7188   Min.   :-2.7378  
##  1st Qu.:-0.91250   1st Qu.:-0.5484   1st Qu.:-0.6425   1st Qu.:-0.6782  
##  Median : 0.03967   Median : 0.2316   Median : 0.3056   Median : 0.1140  
##  Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.96275   3rd Qu.: 0.7350   3rd Qu.: 0.6717   3rd Qu.: 0.7126  
##  Max.   : 1.44288   Max.   : 1.6632   Max.   : 1.4218   Max.   : 2.4730  
##       GNI             Mat.Mor          Ado.Birth          Parli.F       
##  Min.   :-1.7154   Min.   :-0.6992   Min.   :-1.1325   Min.   :-1.8203  
##  1st Qu.:-0.8577   1st Qu.:-0.6496   1st Qu.:-0.8394   1st Qu.:-0.7409  
##  Median : 0.0000   Median :-0.4726   Median :-0.3298   Median :-0.1403  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.8577   3rd Qu.: 0.1932   3rd Qu.: 0.6030   3rd Qu.: 0.6127  
##  Max.   : 1.7154   Max.   : 4.4899   Max.   : 3.8344   Max.   : 3.1850
pca_human <- prcomp(human_std)

s <- summary(pca_human)
pca_pr <- round(100*s$importance[2, ], digits = 1)
pca_pr
##  PC1  PC2  PC3  PC4  PC5  PC6  PC7  PC8 
## 50.7 16.5 12.3  9.7  3.8  2.9  2.6  1.4
biplot(pca_human, choices = 1:2, cex=c(0.8, 1), col=c("grey40", "deeppink2"))

Commentary

GNI, Edu2.Ratio, Edu.Exp and Life.Exp have negative weights on PC1.

Mat.Mor and Ado.Birth have positive weights on PC1.

Labo.FM has the most positive effect on PC2.

5. Interpretation

Components related to e.g. women education and life expectancy seem to explain differences between observations due to their largest weights on PC1.

Component related to women labour seems to affect the PC2 the most.

6. Loading the Tea dataset

library(FactoMineR)

data(tea)

str(tea)
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
##  $ frequency       : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
dim(tea)
## [1] 300  36
library(dplyr)

keep_columns <- c("Tea", "How", "how", "sugar", "where", "lunch")

# select the 'keep_columns' to create a new dataset
tea_time <- dplyr::select(tea, one_of(keep_columns))

# visualize the dataset
gather(tea_time) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x =  element_text(angle = 45, hjust = 1, size = 8))

MCA

mca <- MCA(tea_time, graph = FALSE)

summary(mca)
## 
## Call:
## MCA(X = tea_time, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6
## Variance               0.279   0.261   0.219   0.189   0.177   0.156
## % of var.             15.238  14.232  11.964  10.333   9.667   8.519
## Cumulative % of var.  15.238  29.471  41.435  51.768  61.434  69.953
##                        Dim.7   Dim.8   Dim.9  Dim.10  Dim.11
## Variance               0.144   0.141   0.117   0.087   0.062
## % of var.              7.841   7.705   6.392   4.724   3.385
## Cumulative % of var.  77.794  85.500  91.891  96.615 100.000
## 
## Individuals (the 10 first)
##                       Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## 1                  | -0.298  0.106  0.086 | -0.328  0.137  0.105 | -0.327
## 2                  | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695
## 3                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 4                  | -0.530  0.335  0.460 | -0.318  0.129  0.166 |  0.211
## 5                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 6                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 7                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 8                  | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695
## 9                  |  0.143  0.024  0.012 |  0.871  0.969  0.435 | -0.067
## 10                 |  0.476  0.271  0.140 |  0.687  0.604  0.291 | -0.650
##                       ctr   cos2  
## 1                   0.163  0.104 |
## 2                   0.735  0.314 |
## 3                   0.062  0.069 |
## 4                   0.068  0.073 |
## 5                   0.062  0.069 |
## 6                   0.062  0.069 |
## 7                   0.062  0.069 |
## 8                   0.735  0.314 |
## 9                   0.007  0.003 |
## 10                  0.643  0.261 |
## 
## Categories (the 10 first)
##                        Dim.1     ctr    cos2  v.test     Dim.2     ctr
## black              |   0.473   3.288   0.073   4.677 |   0.094   0.139
## Earl Grey          |  -0.264   2.680   0.126  -6.137 |   0.123   0.626
## green              |   0.486   1.547   0.029   2.952 |  -0.933   6.111
## alone              |  -0.018   0.012   0.001  -0.418 |  -0.262   2.841
## lemon              |   0.669   2.938   0.055   4.068 |   0.531   1.979
## milk               |  -0.337   1.420   0.030  -3.002 |   0.272   0.990
## other              |   0.288   0.148   0.003   0.876 |   1.820   6.347
## tea bag            |  -0.608  12.499   0.483 -12.023 |  -0.351   4.459
## tea bag+unpackaged |   0.350   2.289   0.056   4.088 |   1.024  20.968
## unpackaged         |   1.958  27.432   0.523  12.499 |  -1.015   7.898
##                       cos2  v.test     Dim.3     ctr    cos2  v.test  
## black                0.003   0.929 |  -1.081  21.888   0.382 -10.692 |
## Earl Grey            0.027   2.867 |   0.433   9.160   0.338  10.053 |
## green                0.107  -5.669 |  -0.108   0.098   0.001  -0.659 |
## alone                0.127  -6.164 |  -0.113   0.627   0.024  -2.655 |
## lemon                0.035   3.226 |   1.329  14.771   0.218   8.081 |
## milk                 0.020   2.422 |   0.013   0.003   0.000   0.116 |
## other                0.102   5.534 |  -2.524  14.526   0.197  -7.676 |
## tea bag              0.161  -6.941 |  -0.065   0.183   0.006  -1.287 |
## tea bag+unpackaged   0.478  11.956 |   0.019   0.009   0.000   0.226 |
## unpackaged           0.141  -6.482 |   0.257   0.602   0.009   1.640 |
## 
## Categorical variables (eta2)
##                      Dim.1 Dim.2 Dim.3  
## Tea                | 0.126 0.108 0.410 |
## How                | 0.076 0.190 0.394 |
## how                | 0.708 0.522 0.010 |
## sugar              | 0.065 0.001 0.336 |
## where              | 0.702 0.681 0.055 |
## lunch              | 0.000 0.064 0.111 |
plot(mca, invisible=c("ind"), habillage = "quali")

Commentary

The graph shows associations between the categories. Dimensions are explanatory percentages of the variation with respective values shown in prackets.

Dimension 1 and Dimension 2 explain 18,4 % and 17,1% of the variation, respectively.


Exercise 6

1. Dataset “RATS”

Loading and exploring

library(ggplot2)
library(tidyr)
library(dplyr)

ratsl <- read.csv("/Users/streetman/IODS-project/data/RATSL.csv", row.names = 1)

ratsl$Group <- factor(ratsl$Group)
ratsl$ID <- factor(ratsl$ID)

ratsl <- ratsl %>% group_by(Time) %>% mutate(stweight = (Weight - mean(Weight)) / sd(Weight)) %>% ungroup()

str(ratsl)
## Classes 'tbl_df', 'tbl' and 'data.frame':    176 obs. of  6 variables:
##  $ ID      : Factor w/ 16 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Group   : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 2 2 ...
##  $ WD      : Factor w/ 11 levels "WD1","WD15","WD22",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Weight  : int  240 225 245 260 255 260 275 245 410 405 ...
##  $ Time    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ stweight: num  -1.001 -1.12 -0.961 -0.842 -0.882 ...

Commentary

16 rats, divided into 3 groups based on different diets

Graphics

ggplot(ratsl, aes(x = Time, y = Weight, linetype = ID)) + geom_line() + facet_grid(. ~ Group, labeller = label_both) + scale_linetype_manual(values = rep(1:6, times=4)) + theme(legend.position = "none") +   scale_y_continuous(limits = c(min(ratsl$Weight), max(ratsl$Weight)))

Commentary

Rats are gaining weight with the difference in group 1 that they are lighter in all points with a variance also between the subject with a decrease of value.

Summarizing the dataset

n <- ratsl$Time %>% unique() %>% length()

ratssum <- ratsl %>% group_by(Group, Time) %>% summarise(mean = mean(Weight), se = ( sd(Weight) / sqrt(n) )) %>% ungroup()

glimpse(ratssum)
## Observations: 33
## Variables: 4
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2…
## $ Time  <int> 1, 8, 15, 22, 29, 36, 43, 44, 50, 57, 64, 1, 8, 15, 22, 29…
## $ mean  <dbl> 250.625, 255.000, 254.375, 261.875, 264.625, 265.000, 267.…
## $ se    <dbl> 4.589478, 3.947710, 3.460116, 4.100800, 3.333956, 3.552939…
par(mfrow = c(1,2))

ggplot(ratssum, aes(x = Time, y = mean, linetype = Group, shape = Group)) +
  geom_line() +
  scale_linetype_manual(values = c(1,2,3)) +
  geom_point(size=3) +
  scale_shape_manual(values = c(1,2,3)) +
  theme(legend.position = c(0.9,0.5)) +
  scale_y_continuous(name = "mean(Weight) +/- se(Weight)")

ggplot(ratssum, aes(x = Time, y = mean, linetype = Group, shape = Group)) +
  geom_line() +
  scale_linetype_manual(values = c(1,2,3)) +
  geom_point(size=3) +
  scale_shape_manual(values = c(1,2,3)) +
  geom_errorbar(aes(ymin=mean-se, ymax=mean+se, linetype="1"), width=0.3) +
  theme(legend.position = c(0.9,0.5)) +
  scale_y_continuous(name = "mean(Weight) +/- se(Weight)")

Commentary

In time, means are increasing in both groups.

Box-plotting

ggplot(ratsl, aes(x = factor(Time), y = Weight, fill = Group)) + geom_boxplot()

Elements for summary (visualizing, excluding)

rat8s <- ratsl %>%
  filter(Time > 0) %>%
  group_by(Group, ID) %>%
  summarise( mean=mean(Weight) ) %>%
  ungroup()

glimpse(rat8s)
## Observations: 16
## Variables: 3
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3
## $ ID    <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16
## $ mean  <dbl> 261.0909, 237.6364, 260.1818, 266.5455, 269.4545, 274.7273…
ggplot(rat8s, aes(x=Group, y = mean)) + geom_boxplot() + stat_summary(fun.y = "mean", geom = "point", shape=23, size=4, fill = "white")

glimpse(rat8s)
## Observations: 16
## Variables: 3
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3
## $ ID    <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16
## $ mean  <dbl> 261.0909, 237.6364, 260.1818, 266.5455, 269.4545, 274.7273…
rat8s1 <- filter(rat8s, Group == 2 & mean < 550 | Group == 1 & mean > 260 | Group == 3 & mean > 500)

glimpse(rat8s1)
## Observations: 13
## Variables: 3
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 3, 3, 3
## $ ID    <fct> 1, 3, 4, 5, 6, 7, 8, 9, 10, 11, 14, 15, 16
## $ mean  <dbl> 261.0909, 260.1818, 266.5455, 269.4545, 274.7273, 274.6364…
ggplot(rat8s1, aes(x=Group, y = mean)) + geom_boxplot() + stat_summary(fun.y = "mean", geom = "point", shape=23, size=4, fill = "white")

Comparing means

summary(aov(mean ~ Group, data = rat8s1))
##             Df Sum Sq Mean Sq F value   Pr(>F)    
## Group        2 175958   87979    2577 2.72e-14 ***
## Residuals   10    341      34                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Commentary

Using T-test, outliers excluded, no difference between the groups yet no signicance either

kruskal.test(mean ~ Group, data = rat8s1)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  mean by Group
## Kruskal-Wallis chi-squared = 9.8901, df = 2, p-value = 0.007119

Commentary

With this test we can find signicance

More analysis (also adding from the original dataset)

rats <- read.csv("/Users/streetman/IODS-project/data/rats.csv", row.names = 1)

rat8s2 <- rat8s %>% mutate(baseline = rats$WD1)

glimpse(rat8s2)
## Observations: 16
## Variables: 4
## $ Group    <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3
## $ ID       <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16
## $ mean     <dbl> 261.0909, 237.6364, 260.1818, 266.5455, 269.4545, 274.7…
## $ baseline <int> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 445, …
fit <- lm(mean ~ baseline + Group, data = rat8s2)
summary(fit)
## 
## Call:
## lm(formula = mean ~ baseline + Group, data = rat8s2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -21.732  -3.812   1.991   6.889  13.455 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 30.14886   19.88779   1.516   0.1554    
## baseline     0.93194    0.07793  11.959 5.02e-08 ***
## Group2      31.68866   17.11189   1.852   0.0888 .  
## Group3      21.52296   21.13931   1.018   0.3287    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.62 on 12 degrees of freedom
## Multiple R-squared:  0.9947, Adjusted R-squared:  0.9933 
## F-statistic: 747.8 on 3 and 12 DF,  p-value: 6.636e-14

Commentary

Baseline is showing relatedness to means in later phase, yet, after controlment the baseline no signicance between the group means.

2. Dataset “BPRS”

Loading

bprsl <- read.csv(file = "/Users/streetman/IODS-project/data/bprsl.csv", row.names = 1)

bprsl$treatment <- factor(bprsl$treatment)

bprsl$subject <- factor(bprsl$subject)

glimpse(bprsl)
## Observations: 360
## Variables: 4
## $ treatment <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ subject   <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,…
## $ weeks     <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ bprs      <int> 42, 58, 54, 55, 72, 48, 71, 30, 41, 57, 30, 55, 36, 38…

Graphical

ggplot(bprsl, aes(x = weeks, y = bprs, group = subject)) + geom_line() + facet_grid(. ~ treatment, labeller = label_both) 

Commentary

Points are decreasing in time with all individuals.

Visualizing

bps_reg <- lm(bprs ~ weeks + treatment, data=bprsl)
summary(bps_reg)
## 
## Call:
## lm(formula = bprs ~ weeks + treatment, data = bprsl)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.454  -8.965  -3.196   7.002  50.244 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  46.4539     1.3670  33.982   <2e-16 ***
## weeks        -2.2704     0.2524  -8.995   <2e-16 ***
## treatment2    0.5722     1.3034   0.439    0.661    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.37 on 357 degrees of freedom
## Multiple R-squared:  0.1851, Adjusted R-squared:  0.1806 
## F-statistic: 40.55 on 2 and 357 DF,  p-value: < 2.2e-16

Commentary

LRM of BPRS with explanatory variables of response, weeks and treatment group indicates that decreasing in time with the latter variable being not significant.

RIM (Random Intercept Model)

library(lme4)
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
bps_ref <- lmer(bprs ~ weeks + treatment + (1|subject), data = bprsl, REML=FALSE)

summary(bps_ref)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ weeks + treatment + (1 | subject)
##    Data: bprsl
## 
##      AIC      BIC   logLik deviance df.resid 
##   2748.7   2768.1  -1369.4   2738.7      355 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.0481 -0.6749 -0.1361  0.4813  3.4855 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  subject  (Intercept)  47.41    6.885  
##  Residual             104.21   10.208  
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  46.4539     1.9090  24.334
## weeks        -2.2704     0.2084 -10.896
## treatment2    0.5722     1.0761   0.532
## 
## Correlation of Fixed Effects:
##            (Intr) weeks 
## weeks      -0.437       
## treatment2 -0.282  0.000

Commentary

Within-subject elements are not included.

RIM and RSM (Random Slope Model)

bps_ref2 <- lmer(bprs ~ weeks + treatment + (weeks|subject), data = bprsl, REML=FALSE)

summary(bps_ref2)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ weeks + treatment + (weeks | subject)
##    Data: bprsl
## 
##      AIC      BIC   logLik deviance df.resid 
##   2745.4   2772.6  -1365.7   2731.4      353 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.8919 -0.6194 -0.0691  0.5531  3.7977 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  subject  (Intercept) 64.8222  8.0512        
##           weeks        0.9609  0.9803   -0.51
##  Residual             97.4304  9.8707        
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  46.4539     2.1052  22.066
## weeks        -2.2704     0.2977  -7.626
## treatment2    0.5722     1.0405   0.550
## 
## Correlation of Fixed Effects:
##            (Intr) weeks 
## weeks      -0.582       
## treatment2 -0.247  0.000

Analyzing with ANOVA

anova(bps_ref, bps_ref2)
## Data: bprsl
## Models:
## bps_ref: bprs ~ weeks + treatment + (1 | subject)
## bps_ref2: bprs ~ weeks + treatment + (weeks | subject)
##          Df    AIC    BIC  logLik deviance  Chisq Chi Df Pr(>Chisq)  
## bps_ref   5 2748.7 2768.1 -1369.4   2738.7                           
## bps_ref2  7 2745.4 2772.6 -1365.7   2731.4 7.2721      2    0.02636 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Commentary

Model 2 is better

RIM and SLM (interaction included)

bps_ref3 <- lmer(bprs ~ weeks * treatment + (weeks|subject), data = bprsl, REML=FALSE)

summary(bps_ref3)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ weeks * treatment + (weeks | subject)
##    Data: bprsl
## 
##      AIC      BIC   logLik deviance df.resid 
##   2744.3   2775.4  -1364.1   2728.3      352 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.0512 -0.6271 -0.0768  0.5288  3.9260 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  subject  (Intercept) 64.9964  8.0620        
##           weeks        0.9687  0.9842   -0.51
##  Residual             96.4707  9.8220        
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##                  Estimate Std. Error t value
## (Intercept)       47.8856     2.2521  21.262
## weeks             -2.6283     0.3589  -7.323
## treatment2        -2.2911     1.9090  -1.200
## weeks:treatment2   0.7158     0.4010   1.785
## 
## Correlation of Fixed Effects:
##             (Intr) weeks  trtmn2
## weeks       -0.650              
## treatment2  -0.424  0.469       
## wks:trtmnt2  0.356 -0.559 -0.840

Commentary

The model indicates that the BPRS slopes with a decreasing element are higher for individuals in treatment group 2 in comparison.

anova(bps_ref2, bps_ref3)
## Data: bprsl
## Models:
## bps_ref2: bprs ~ weeks + treatment + (weeks | subject)
## bps_ref3: bprs ~ weeks * treatment + (weeks | subject)
##          Df    AIC    BIC  logLik deviance  Chisq Chi Df Pr(>Chisq)  
## bps_ref2  7 2745.4 2772.6 -1365.7   2731.4                           
## bps_ref3  8 2744.3 2775.4 -1364.1   2728.3 3.1712      1    0.07495 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Commentary

Model 3 doesn’t show any superiority, therefore, not better

Fitting and plotting

Fitted <- fitted(bps_ref2)

bprsl <- bprsl %>% mutate(Fitted)

p1 <- ggplot(bprsl, aes(x = weeks, y = bprs, group = subject)) + geom_line() + facet_grid(. ~ treatment, labeller = label_both) + scale_x_continuous(name = "Time (weeks)", breaks = seq(0, 8, 1)) +  ylim(0,100) + theme(legend.position = "top") + ggtitle("Observed")

p2 <- ggplot(bprsl, aes(x = weeks, y = Fitted, group = subject)) +
      geom_line() + facet_grid(. ~ treatment, labeller = label_both) +
      scale_x_continuous(name = "Time (weeks)", breaks = seq(0, 8, 1)) +
      ylim(0, 100) +
      theme(legend.position = "top") + ggtitle("Fitted")
    
p1

p2